PROBLEM STATEMENT:¶
Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will often have a higher volume of service consumption and can generate additional customer referrals.
Customer retention can be achieved with good customer service and products. But the most effective way for a company to prevent attrition of customers is to truly know them. The vast volumes of data collected about customers can be used to build churn prediction models. Knowing who is most likely to defect means that a company can prioritise focused marketing efforts on that subset of their customer base.
Preventing customer churn is critically important to the telecommunications sector, as the barriers to entry for switching services are so low.
You will examine customer data from IBM Sample Data Sets with the aim of building and comparing several customer churn prediction models.¶
- Importing require library for performing EDA, Data Wrangling and data cleaning
import pandas as pd # for data wrangling purpose
import numpy as np # Basic computation library
import seaborn as sns # For Visualization
import matplotlib.pyplot as plt # ploting package
import warnings # Filtering warnings
warnings.filterwarnings('ignore')
# Importing Customer Churn Analysis dataset Csv file using pandas
df=pd.read_csv('Telecom_customer_churn.csv')
df.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
# As we have 31 Columns Lets sort Columns by their datatype
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
Comment :¶
- We have 7043 Rows and 21 Columns in this Telecom Dataset.
- We have target Variable 'Churn' with object datatype, leading this to classification problem.
- There is interesting entry here under object datatype which is 'TotalCharges'. This feature is numerical in nature but categories as Object datatypes. This implies that there is presence of string variable in this column or might be data error.
- 'SeniorCitizen' is categorical variable listed as Numerical variable. So we gone convert it into object datatype.
- At end we have 3 Numerical variable and 18 categorical variable. Out of which 'CustomerID' is unnecessary variable from our analytical & modelling viewpoint. We will drop 'CustomerID' column.
We are going to Group Variable into Numerical and Categorical variables list in order to simplify further analysis. Next thing is droping CustomerID Column.¶
df.drop(['customerID'],axis=1,inplace=True)
Statistical Analysis¶
Before Going for Statistical exploration of data, first check integrity of data & Missing value
Data Integrity Check¶
Since dataset is large, Let check for any entry which is repeated or duplicated in dataset.
df.duplicated().sum() # This will check the duplicate data for all columns.
22
We can see that 22 duplicate entry in dataset. Let drop duplicated entry from dataset.
df.drop_duplicates(keep='last',inplace= True)
df.shape
(7021, 20)
df["TotalCharges"].unique()
array(['29.85', '1889.5', '108.15', ..., '346.45', '306.6', '6844.5'],
dtype=object)
Now check for any whitespaces, NA,'-' in dataset. We might find something in TotalCharges column by considering Object datatype .¶
df[df["TotalCharges"].isin([' ','NA','-'])==True]
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | No | |
| 753 | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | No | |
| 936 | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | No | |
| 1082 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | No | |
| 1340 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | No | |
| 3331 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | No | |
| 3826 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | No | |
| 4380 | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | No | |
| 5218 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | No | |
| 6670 | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | No | |
| 6754 | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | No |
There is possibility of whitespaces in TotalCharges column. lets deal with it.
# Replaceing Whitespaces with null values
df['TotalCharges']= df['TotalCharges'].replace(' ',np.nan)
# Converting object datatype into float
df['TotalCharges']= df['TotalCharges'].astype(float)
df['TotalCharges'].fillna(df['TotalCharges'].mean(),inplace=True)
We have remove whitespaces so let now check for missing values
We can impute missing value in TotalCharges either with mean and median. We can decide imputation method after checking distribution and Outliers in data¶
plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
sns.boxplot(y='TotalCharges', data=df,color='cyan')
plt.ylabel('TotalCharges',fontsize=15)
plt.subplot(1,2,2)
sns.distplot(df['TotalCharges'], color='b')
plt.xlabel('TotalCharges',fontsize=15)
plt.tight_layout()
plt.show()
print("Mean of TotalCharges:",df['TotalCharges'].mean())
print("Median of TotalCharges:",df['TotalCharges'].median())
Mean of TotalCharges: 2290.3533880171185 Median of TotalCharges: 1410.25
Observation:¶
- We can see that Outliers doesnot exist, so no mean sensitivity issue present here.
- Distribution plot shows that Total Charges feature is right skewed.
- Mean is greater than Median.
Considering above observation we can impute Missing value with Mean.
Imputation of Missing value in TotalCharges with Mean¶
df['TotalCharges']=df['TotalCharges'].fillna(df['TotalCharges'].mean())
Checking for Null values after Imputation¶
plt.figure(figsize=(9,6))
sns.heatmap(df.isnull(),cmap="cool_r")
plt.show()
Statistical Matrix¶
df.describe().T.style.background_gradient(subset=['mean','std','50%','count'], cmap='RdPu')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SeniorCitizen | 7021.000000 | 0.162512 | 0.368947 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| tenure | 7021.000000 | 32.469449 | 24.534965 | 0.000000 | 9.000000 | 29.000000 | 55.000000 | 72.000000 |
| MonthlyCharges | 7021.000000 | 64.851894 | 30.069001 | 18.250000 | 35.750000 | 70.400000 | 89.900000 | 118.750000 |
| TotalCharges | 7021.000000 | 2290.353388 | 2265.044136 | 18.800000 | 411.150000 | 1410.250000 | 3801.700000 | 8684.800000 |
# df[Categorical].describe().T
Numerical=df.select_dtypes(exclude="object")
Categorical=df.select_dtypes(include="object")
for i in Categorical:
print(df[i].value_counts())
print("="*100)
gender Male 3541 Female 3480 Name: count, dtype: int64 ==================================================================================================== Partner No 3619 Yes 3402 Name: count, dtype: int64 ==================================================================================================== Dependents No 4911 Yes 2110 Name: count, dtype: int64 ==================================================================================================== PhoneService Yes 6339 No 682 Name: count, dtype: int64 ==================================================================================================== MultipleLines No 3368 Yes 2971 No phone service 682 Name: count, dtype: int64 ==================================================================================================== InternetService Fiber optic 3090 DSL 2419 No 1512 Name: count, dtype: int64 ==================================================================================================== OnlineSecurity No 3490 Yes 2019 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== OnlineBackup No 3080 Yes 2429 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== DeviceProtection No 3087 Yes 2422 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== TechSupport No 3465 Yes 2044 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== StreamingTV No 2802 Yes 2707 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== StreamingMovies No 2777 Yes 2732 No internet service 1512 Name: count, dtype: int64 ==================================================================================================== Contract Month-to-month 3853 Two year 1695 One year 1473 Name: count, dtype: int64 ==================================================================================================== PaperlessBilling Yes 4161 No 2860 Name: count, dtype: int64 ==================================================================================================== PaymentMethod Electronic check 2359 Mailed check 1596 Bank transfer (automatic) 1544 Credit card (automatic) 1522 Name: count, dtype: int64 ==================================================================================================== Churn No 5164 Yes 1857 Name: count, dtype: int64 ====================================================================================================
sns.set_palette('hsv')
plt.figure(figsize=(20,40), facecolor='white')
plotnumber =1
for i in Categorical:
if plotnumber <=16:
ax = plt.subplot(4,4,plotnumber)
sns.countplot(x=df[i])
plt.xlabel(i,fontsize=20)
plotnumber+=1
plt.show()
Now Start exploreing feature one by one, begin with Target Feature
Target Variable Churn¶
sns.set_palette('husl')
f,ax=plt.subplots(1,2,figsize=(15,8))
df['Churn'].value_counts().plot.pie(explode=[0,0.1],autopct='%3.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Churn Distribution', fontsize=22,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Churn',data=df,ax=ax[1])
ax[1].set_title('Churn Distribution',fontsize=22,fontweight ='bold')
ax[1].set_xlabel("Churn",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=18,fontweight ='bold')
plt.show()
Comment :¶
- 26.4 % Customer choose to churn service in last month. Which is quite high number.This all leads to imbalanced data case as churn is our target variable.
Let start exploration of Independent feature to figure where customer are unstatisfied and what are customers need or inclination in cutting edge competition.
Gender vs Churn : Can there exist any trend between gender & churn or any impact of gender on Churn?¶
sns.set_palette('husl')
fig,ax=plt.subplots(1,2,figsize=(16,8))
df['gender'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Gender', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='gender',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Gender-wise Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Churn ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
pd.crosstab(df['gender'],df["Churn"],margins=True).style.background_gradient(cmap='summer_r')
| Churn | No | Yes | All |
|---|---|---|---|
| gender | |||
| Female | 2546 | 934 | 3480 |
| Male | 2618 | 923 | 3541 |
| All | 5164 | 1857 | 7021 |
plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)
# show plot
plt.axis('equal')
plt.tight_layout()
plt.show()
sns.set_palette('husl')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['SeniorCitizen'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Senior Citizen Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='SeniorCitizen',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Senior Citizen-wise Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Churn ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
There are only 16.3 % of the customers who are senior citizens. Thus most of our customers in the data are younger people.
pd.crosstab([df.gender,df.SeniorCitizen],df["Churn"],margins=True).style.background_gradient(cmap='summer_r')
| Churn | No | Yes | All | |
|---|---|---|---|---|
| gender | SeniorCitizen | |||
| Female | 0 | 2218 | 695 | 2913 |
| 1 | 328 | 239 | 567 | |
| Male | 0 | 2280 | 687 | 2967 |
| 1 | 338 | 236 | 574 | |
| All | 5164 | 1857 | 7021 |
# Comparing tenure and SeniorCitizen
plt.title("Comparison between tenure and SeniorCitizen")
sns.stripplot(x = "SeniorCitizen",y="tenure",data = df)
plt.show()
Around 16% customer are Senior citizen and form countplot we can see they have more tendency to churn.
There is no significant relation between Senior Citizen and Tenure.
Effect of Partner and Dependents on Churn¶
sns.set_palette('Set1')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['Partner'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Partner Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Partner',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of Partner on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Partner ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
sns.set_palette('rainbow')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['Dependents'].value_counts().plot.pie(explode=[0,0.1],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Dependents Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='Dependents',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of Dependents on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Dependents ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
Observation-¶
- Customer having Partner have less tendency to Churn.
- Almost 30% Customer have dependents on them and they also have less tendency to churn compare to remaining 70%
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('coolwarm')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['StreamingTV'].value_counts().plot.pie(explode=[0.03,0.03,0.03],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('StreamingTV Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='StreamingTV',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of StreamingTV on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("StreamingTV ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('tab10')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['InternetService'].value_counts().plot.pie(explode=[0.03,0.03,0.03],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('InternetService Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x='InternetService',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of InternetService on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("InternetService ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
plt.figure(figsize=(8,5))
sns.scatterplot(x="InternetService", y='MonthlyCharges',data=df,hue="Churn")
plt.show()
44% Customer perfer Fibre optic as Interent service and surpringly we can find high churn rate among them.
We can find high monthly charges among customer using fiber optic compare to DSL. We can conclude that High charges is reason of customer churn.
plt.rcParams["figure.autolayout"] = True
sns.set_palette('rainbow_r')
f, ax = plt.subplots(1, 2, figsize=(16, 8))
df['StreamingMovies'].value_counts().plot.pie(
explode=[0.03, 0.03, 0.03],
autopct='%2.1f%%',
ax=ax[0],
shadow=True
)
ax[0].set_title('StreamingMovies Distribution', fontsize=20, fontweight='bold')
ax[0].set_ylabel('')
sns.countplot(x='StreamingMovies', hue='Churn', data=df, ax=ax[1])
ax[1].set_title('Effect of StreamingMovies on Churn Tendency', fontsize=20, fontweight='bold')
ax[1].set_xlabel("StreamingMovies", fontsize=18, fontweight='bold')
plt.xticks(fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Almost same churn tendency in people streaming movies and not.
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('husl')
f, ax = plt.subplots(1, 2, figsize=(16, 8))
df['Contract'].value_counts().plot.pie(
explode=[0.03, 0.03, 0.03],
autopct='%2.1f%%',
ax=ax[0],
shadow=True
)
ax[0].set_title('Contract Distribution', fontsize=20, fontweight='bold')
ax[0].set_ylabel('')
sns.countplot(x='Contract', hue='Churn', data=df, ax=ax[1])
ax[1].set_title('Effect of Contract on Churn Tendency', fontsize=20, fontweight='bold')
ax[1].set_xlabel("Contract", fontsize=18, fontweight='bold')
plt.xticks(fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
plt.figure(figsize=(8,5))
sns.scatterplot(x="Contract", y='MonthlyCharges',data=df,hue="Churn")
plt.show()
Almost 55% customer perfer month to month contract compare to other.We also find high churn rate in these customer.
We did not find any relation between Monthly charges and contract tenure.
plt.rcParams["figure.autolayout"] = True
sns.set_palette('gist_earth')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['PaperlessBilling'].value_counts().plot.pie(explode=[0.03,0.03],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('PaperlessBilling Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x = 'PaperlessBilling',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of PaperlessBilling on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("PaperlessBilling ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=14,fontweight ='bold')
plt.tight_layout()
plt.show()
60% Customer perfer paperless billing.
The customers who prefer paperless billing they have high churn rate.
#plt.rcParams["figure.autolayout"] = True
sns.set_palette('coolwarm')
f,ax=plt.subplots(1,2,figsize=(16,8))
df['PaymentMethod'].value_counts().plot.pie(explode=[0.03,0.03,0.03,0.03],autopct='%2.1f%%',
ax=ax[0],shadow=True)
ax[0].set_title('Payment Method Distribution', fontsize=20,fontweight ='bold')
ax[0].set_ylabel('')
sns.countplot(x = 'PaymentMethod',hue="Churn",data=df,ax=ax[1])
ax[1].set_title('Effect of PaymentMethod on Churn Tendency',fontsize=20,fontweight ='bold')
ax[1].set_xlabel("Payment Method ",fontsize=18,fontweight ='bold')
plt.xticks(fontsize=12,rotation=15)
plt.tight_layout()
plt.show()
We can see high Attrition tendency in customer who pay by Electronic check.
sns.set_palette('tab20_r')
fig , ax=plt.subplots(2,2, figsize=(15,10))
for i,col in enumerate(["MonthlyCharges","TotalCharges"]):
sns.scatterplot(ax=ax[0,i],x="tenure", y=col,data=df,hue="Churn")
sns.lineplot(ax=ax[1,i],x="tenure", y=col,data=df,hue="Churn")
Observation:¶
- High Monthly Charges in customer who choose churn compare to rest.
- Same goes with High Total Charges in customer who choose churn compare to rest.
sns.pairplot(df,hue="Churn",palette="Dark2")
plt.show()
Encoding categorical data¶
df.columns.to_series().groupby(df.dtypes).groups
{int64: ['SeniorCitizen', 'tenure'], float64: ['MonthlyCharges', 'TotalCharges'], object: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']}
df.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
Numerical =['tenure','MonthlyCharges', 'TotalCharges']
# df["--"] = df["--"].astype("object")
Category =['gender', 'Partner','PhoneService', 'Dependents', 'MultipleLines', 'InternetService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies',
'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
# Using Label Encoder on categorical variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Categorical=df.select_dtypes(include="object")
for i in Categorical:
df[i] = le.fit_transform(df[i])
df.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 29.85 | 29.85 | 0 |
| 1 | 1 | 0 | 0 | 0 | 34 | 1 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 56.95 | 1889.50 | 0 |
| 2 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 53.85 | 108.15 | 1 |
| 3 | 1 | 0 | 0 | 0 | 45 | 0 | 1 | 0 | 2 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 0 | 42.30 | 1840.75 | 0 |
| 4 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 70.70 | 151.65 | 1 |
Feature selection and Engineering¶
1. Outliers Detection and Removal¶
plt.figure(figsize=(18,10),facecolor='white')
plotnumber=1
for column in Numerical:
if plotnumber<=4:
ax=plt.subplot(2,2,plotnumber)
sns.boxplot(df[column],color='g')
plt.xlabel(column,fontsize=20)
plotnumber+=1
plt.show()
From Boxplot we can see No outliers exist dataset.
Outliers removal using Zscore method¶
from scipy.stats import zscore
z = np.abs(zscore(df))
df1 = df[(z<3).all(axis = 1)]
print ("Shape of the dataframe before removing outliers: ", df.shape)
print ("Shape of the dataframe after removing outliers: ", df1.shape)
print ("Percentage of data loss post outlier removal: ", (df.shape[0]-df1.shape[0])/df.shape[0]*100)
df=df1.copy() # reassigning the changed dataframe name to our original dataframe name
Shape of the dataframe before removing outliers: (7021, 20) Shape of the dataframe after removing outliers: (6339, 20) Percentage of data loss post outlier removal: 9.713715994872524
df['PhoneService'].unique()
array([1])
df.drop(['PhoneService'],axis=1,inplace=True)
df.skew() #-0.5 to 0.5---
gender -0.012939 SeniorCitizen 1.819335 Partner 0.049562 Dependents 0.871194 tenure 0.233517 MultipleLines 0.125532 InternetService 0.051965 OnlineSecurity 0.421216 OnlineBackup 0.166121 DeviceProtection 0.181524 TechSupport 0.408970 StreamingTV -0.005185 StreamingMovies -0.012505 Contract 0.624212 PaperlessBilling -0.388673 PaymentMethod -0.165613 MonthlyCharges -0.404120 TotalCharges 0.895850 Churn 1.058644 dtype: float64
num=["tenure","MonthlyCharges","TotalCharges"]
2. Skewness of features¶
plt.figure(figsize=(20,5),facecolor='white')
sns.set_palette('plasma')
plotnum=1
for col in num:
if plotnum<=4:
plt.subplot(2,2,plotnum)
sns.distplot(df[col])
plt.xlabel(col,fontsize=20)
plotnum+=1
plt.show()
Skewness is important feature for continous data.
There is no relevence of skweness for discrete numerical feature like month and categorical feature.So we gone ignore skewness present in discrete numerical and categorical feature.
df.skew() #-0.5 se 0.5
gender -0.012939 SeniorCitizen 1.819335 Partner 0.049562 Dependents 0.871194 tenure 0.233517 MultipleLines 0.125532 InternetService 0.051965 OnlineSecurity 0.421216 OnlineBackup 0.166121 DeviceProtection 0.181524 TechSupport 0.408970 StreamingTV -0.005185 StreamingMovies -0.012505 Contract 0.624212 PaperlessBilling -0.388673 PaymentMethod -0.165613 MonthlyCharges -0.404120 TotalCharges 0.895850 Churn 1.058644 dtype: float64
'tenure','MonthlyCharges', 'TotalCharges' are continous numerical feature in dataset.
Out of which TotalCharges is skewed in nature. Which we gone transform here.
df['TotalCharges'] = np.log1p(df['TotalCharges'])
3. Corrleation¶
df.corr()
| gender | SeniorCitizen | Partner | Dependents | tenure | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gender | 1.000000 | -0.005846 | -0.002207 | 0.015722 | 0.001891 | -0.006391 | 0.000983 | -0.016826 | -0.009353 | -0.003121 | -0.009769 | -0.005624 | -0.002760 | 0.000674 | -0.018131 | 0.021961 | -0.011639 | -0.006783 | -0.011391 |
| SeniorCitizen | -0.005846 | 1.000000 | 0.013943 | -0.213486 | 0.017647 | 0.152954 | -0.039479 | -0.123668 | -0.020710 | -0.023590 | -0.144694 | 0.028453 | 0.047062 | -0.141107 | 0.155193 | -0.041891 | 0.238426 | 0.111597 | 0.149599 |
| Partner | -0.002207 | 0.013943 | 1.000000 | 0.453409 | 0.382932 | 0.147545 | -0.004099 | 0.151348 | 0.154738 | 0.167390 | 0.132266 | 0.133353 | 0.127676 | 0.297393 | -0.010458 | -0.147854 | 0.088571 | 0.337926 | -0.153262 |
| Dependents | 0.015722 | -0.213486 | 0.453409 | 1.000000 | 0.159194 | -0.028535 | 0.053701 | 0.146427 | 0.090389 | 0.082944 | 0.130166 | 0.048859 | 0.023932 | 0.242286 | -0.106970 | -0.037411 | -0.131791 | 0.084275 | -0.158628 |
| tenure | 0.001891 | 0.017647 | 0.382932 | 0.159194 | 1.000000 | 0.358098 | -0.034932 | 0.326356 | 0.377187 | 0.367678 | 0.324457 | 0.282710 | 0.292966 | 0.674586 | 0.002370 | -0.361878 | 0.242184 | 0.827354 | -0.348882 |
| MultipleLines | -0.006391 | 0.152954 | 0.147545 | -0.028535 | 0.358098 | 1.000000 | -0.107675 | 0.006752 | 0.125043 | 0.130055 | 0.011287 | 0.187307 | 0.193380 | 0.114261 | 0.174017 | -0.183244 | 0.454819 | 0.458583 | 0.042438 |
| InternetService | 0.000983 | -0.039479 | -0.004099 | 0.053701 | -0.034932 | -0.107675 | 1.000000 | -0.027406 | 0.030417 | 0.049829 | -0.022841 | 0.099513 | 0.094169 | 0.115528 | -0.164085 | 0.096674 | -0.470605 | -0.260767 | -0.058968 |
| OnlineSecurity | -0.016826 | -0.123668 | 0.151348 | 0.146427 | 0.326356 | 0.006752 | -0.027406 | 1.000000 | 0.198167 | 0.173275 | 0.283252 | 0.046717 | 0.062345 | 0.367667 | -0.154346 | -0.089597 | -0.071808 | 0.207795 | -0.289182 |
| OnlineBackup | -0.009353 | -0.020710 | 0.154738 | 0.090389 | 0.377187 | 0.125043 | 0.030417 | 0.198167 | 1.000000 | 0.195604 | 0.210090 | 0.151646 | 0.139587 | 0.286126 | -0.019141 | -0.126394 | 0.110079 | 0.310079 | -0.201206 |
| DeviceProtection | -0.003121 | -0.023590 | 0.167390 | 0.082944 | 0.367678 | 0.130055 | 0.049829 | 0.173275 | 0.195604 | 1.000000 | 0.241956 | 0.278088 | 0.284397 | 0.342751 | -0.040732 | -0.132907 | 0.154859 | 0.318027 | -0.176171 |
| TechSupport | -0.009769 | -0.144694 | 0.132266 | 0.130166 | 0.324457 | 0.011287 | -0.022841 | 0.283252 | 0.210090 | 0.241956 | 1.000000 | 0.174169 | 0.179502 | 0.417344 | -0.107286 | -0.104360 | -0.022495 | 0.225559 | -0.279455 |
| StreamingTV | -0.005624 | 0.028453 | 0.133353 | 0.048859 | 0.282710 | 0.187307 | 0.099513 | 0.046717 | 0.151646 | 0.278088 | 0.174169 | 1.000000 | 0.429865 | 0.226185 | 0.092405 | -0.094395 | 0.326687 | 0.315502 | -0.033758 |
| StreamingMovies | -0.002760 | 0.047062 | 0.127676 | 0.023932 | 0.292966 | 0.193380 | 0.094169 | 0.062345 | 0.139587 | 0.284397 | 0.179502 | 0.429865 | 1.000000 | 0.232550 | 0.071674 | -0.100094 | 0.328388 | 0.326481 | -0.039644 |
| Contract | 0.000674 | -0.141107 | 0.297393 | 0.242286 | 0.674586 | 0.114261 | 0.115528 | 0.367667 | 0.286126 | 0.342751 | 0.417344 | 0.226185 | 0.232550 | 1.000000 | -0.179155 | -0.222411 | -0.106646 | 0.427736 | -0.396873 |
| PaperlessBilling | -0.018131 | 0.155193 | -0.010458 | -0.106970 | 0.002370 | 0.174017 | -0.164085 | -0.154346 | -0.019141 | -0.040732 | -0.107286 | 0.092405 | 0.071674 | -0.179155 | 1.000000 | -0.062858 | 0.377321 | 0.151738 | 0.195364 |
| PaymentMethod | 0.021961 | -0.041891 | -0.147854 | -0.037411 | -0.361878 | -0.183244 | 0.096674 | -0.089597 | -0.126394 | -0.132907 | -0.104360 | -0.094395 | -0.100094 | -0.222411 | -0.062858 | 1.000000 | -0.195322 | -0.363128 | 0.103054 |
| MonthlyCharges | -0.011639 | 0.238426 | 0.088571 | -0.131791 | 0.242184 | 0.454819 | -0.470605 | -0.071808 | 0.110079 | 0.154859 | -0.022495 | 0.326687 | 0.328388 | -0.106646 | 0.377321 | -0.195322 | 1.000000 | 0.579093 | 0.218422 |
| TotalCharges | -0.006783 | 0.111597 | 0.337926 | 0.084275 | 0.827354 | 0.458583 | -0.260767 | 0.207795 | 0.310079 | 0.318027 | 0.225559 | 0.315502 | 0.326481 | 0.427736 | 0.151738 | -0.363128 | 0.579093 | 1.000000 | -0.223677 |
| Churn | -0.011391 | 0.149599 | -0.153262 | -0.158628 | -0.348882 | 0.042438 | -0.058968 | -0.289182 | -0.201206 | -0.176171 | -0.279455 | -0.033758 | -0.039644 | -0.396873 | 0.195364 | 0.103054 | 0.218422 | -0.223677 | 1.000000 |
plt.figure(figsize=(25,15))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, square=True, fmt='0.3f',
annot_kws={'size':10}, cmap="gist_stern")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
plt.figure(figsize = (18,6))
df.corr()['Churn'].drop(['Churn']).sort_values(ascending=False).plot(kind='bar',color = 'purple')
plt.xlabel('Features',fontsize=15,fontweight='bold')
plt.ylabel('Churn',fontsize=15,fontweight='bold')
plt.title('Correlation of features with Target Variable Churn',fontsize = 20,fontweight='bold')
plt.show()
4. Balanceing Imbalanced target feature¶
df.Churn.value_counts()
Churn 0 4652 1 1687 Name: count, dtype: int64
As Target variable data is Imbalanced in nature we will need to balance target variable.
Balancing using SMOTE¶
from imblearn.over_sampling import SMOTE
# Splitting data in target and dependent feature
X = df.drop(['Churn'], axis =1)
Y = df['Churn']
# Oversampleing using SMOTE Techniques
oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)
Y.value_counts()
Churn 0 4652 1 4652 Name: count, dtype: int64
We have successfully resolved the class imbalanced problem and now all the categories have same data ensuring that the ML model does not get biased towards one category.
Standard Scaling¶
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X_scale = scaler.fit_transform(X)
5. Checking Multicollinearity between features using variance_inflation_factor¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(X_scale,i) for i in range(len(X.columns))]
vif["Features"] = X.columns
vif
| VIF values | Features | |
|---|---|---|
| 0 | 1.014034 | gender |
| 1 | 1.097927 | SeniorCitizen |
| 2 | 1.541789 | Partner |
| 3 | 1.428879 | Dependents |
| 4 | 6.495017 | tenure |
| 5 | 1.428823 | MultipleLines |
| 6 | 1.470469 | InternetService |
| 7 | 1.345447 | OnlineSecurity |
| 8 | 1.250707 | OnlineBackup |
| 9 | 1.318419 | DeviceProtection |
| 10 | 1.396276 | TechSupport |
| 11 | 1.513197 | StreamingTV |
| 12 | 1.488277 | StreamingMovies |
| 13 | 2.540526 | Contract |
| 14 | 1.181381 | PaperlessBilling |
| 15 | 1.172448 | PaymentMethod |
| 16 | 3.176709 | MonthlyCharges |
| 17 | 5.959530 | TotalCharges |
Independent feature VIF is within permissible limit of 10
Machine Learning Model Building¶
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
X_train, X_test, Y_train, Y_test = train_test_split(X_scale, Y, random_state=99, test_size=.3)
# print('Training feature matrix size:',X_train.shape)
# print('Training target vector size:',Y_train.shape)
# print('Test feature matrix size:',X_test.shape)
# print('Test target vector size:',Y_test.shape)
Finding best Random state¶
#from sklearn.linear_model import LogisticRegression
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,f1_score
maxAccu=0
maxRS=0
for i in range(1,250):
X_train,X_test,Y_train,Y_test = train_test_split(X_scale,Y,test_size = 0.3, random_state=i)
lr=LogisticRegression()
lr.fit(X_train,Y_train)
y_pred=lr.predict(X_test)
acc=accuracy_score(Y_test,y_pred)
if acc>maxAccu:
maxAccu=acc
maxRS=i
print('Best accuracy is', maxAccu ,'on Random_state', maxRS)
Best accuracy is 0.8130372492836676 on Random_state 11
X_train, X_test, Y_train, Y_test = train_test_split(X_scale, Y, random_state=90, test_size=.3)
lr=LogisticRegression()
lr.fit(X_train,Y_train)
y_pred=lr.predict(X_test)
print(classification_report(Y_test, y_pred))
print()
print(confusion_matrix(Y_test, y_pred))
precision recall f1-score support
0 0.85 0.76 0.80 1415
1 0.78 0.86 0.82 1377
accuracy 0.81 2792
macro avg 0.82 0.81 0.81 2792
weighted avg 0.82 0.81 0.81 2792
[[1078 337]
[ 188 1189]]
Finding Optimal value of n_neighbors for KNN¶
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsClassifier(n_neighbors = K)
model.fit(X_train,Y_train) #fit the model
y_pred=model.predict(X_test) #make prediction on test set
error = sqrt(mean_squared_error(Y_test,y_pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k= ' , K , 'is:', error)
RMSE value for k= 1 is: 0.4791497962230918 RMSE value for k= 2 is: 0.49351090017950827 RMSE value for k= 3 is: 0.45735003491992954 RMSE value for k= 4 is: 0.4534174594676999 RMSE value for k= 5 is: 0.4565662296568125 RMSE value for k= 6 is: 0.45578107648827826 RMSE value for k= 7 is: 0.46047191440320734 RMSE value for k= 8 is: 0.45617382199543793 RMSE value for k= 9 is: 0.4577414342512293 RMSE value for k= 10 is: 0.46008283793795546 RMSE value for k= 11 is: 0.45617382199543793 RMSE value for k= 12 is: 0.4549945684363609 RMSE value for k= 13 is: 0.45420669846267875 RMSE value for k= 14 is: 0.4494504763141606 RMSE value for k= 15 is: 0.45420669846267875 RMSE value for k= 16 is: 0.4518348457054815 RMSE value for k= 17 is: 0.4514383253608233 RMSE value for k= 18 is: 0.45262684428999544 RMSE value for k= 19 is: 0.4502466691017817 RMSE value for k= 20 is: 0.45223101837756335
#plotting the rmse values against k values -
plt.figure(figsize = (8,6))
plt.plot(range(20), rmse_val, color='blue', linestyle='dashed', marker='o', markerfacecolor='green', markersize=10)
plt.show()
Comment-¶
At k=18, we get the minimum RMSE value which approximately 0.44059740636840716, and shoots up on further increasing the k value. We can safely say that k=18 will give us the best result in this case
Applying other classification algorithm¶
model=[ LogisticRegression(),
SVC(),
GaussianNB(),
DecisionTreeClassifier(),
KNeighborsClassifier(n_neighbors = 18),
RandomForestClassifier(),
ExtraTreesClassifier()]
for m in model:
m.fit(X_train,Y_train)
y_pred=m.predict(X_test)
print('\033[1m'+'Classification ML Algorithm Evaluation Matrix',m,'is' +'\033[0m')
print('\n')
print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))
print('\n')
print('\033[1m'+'Confusion matrix :'+'\033[0m \n',confusion_matrix(Y_test, y_pred))
print('\n')
print('\033[1m'+'Classification Report :'+'\033[0m \n',classification_report(Y_test, y_pred))
print('\n')
print('============================================================================================================')
Classification ML Algorithm Evaluation Matrix LogisticRegression() is Accuracy Score : 0.8119627507163324 Confusion matrix : [[1078 337] [ 188 1189]] Classification Report : precision recall f1-score support 0 0.85 0.76 0.80 1415 1 0.78 0.86 0.82 1377 accuracy 0.81 2792 macro avg 0.82 0.81 0.81 2792 weighted avg 0.82 0.81 0.81 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix SVC() is Accuracy Score : 0.8330945558739254 Confusion matrix : [[1117 298] [ 168 1209]] Classification Report : precision recall f1-score support 0 0.87 0.79 0.83 1415 1 0.80 0.88 0.84 1377 accuracy 0.83 2792 macro avg 0.84 0.83 0.83 2792 weighted avg 0.84 0.83 0.83 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix GaussianNB() is Accuracy Score : 0.794054441260745 Confusion matrix : [[1075 340] [ 235 1142]] Classification Report : precision recall f1-score support 0 0.82 0.76 0.79 1415 1 0.77 0.83 0.80 1377 accuracy 0.79 2792 macro avg 0.80 0.79 0.79 2792 weighted avg 0.80 0.79 0.79 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix DecisionTreeClassifier() is Accuracy Score : 0.7872492836676218 Confusion matrix : [[1089 326] [ 268 1109]] Classification Report : precision recall f1-score support 0 0.80 0.77 0.79 1415 1 0.77 0.81 0.79 1377 accuracy 0.79 2792 macro avg 0.79 0.79 0.79 2792 weighted avg 0.79 0.79 0.79 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix KNeighborsClassifier(n_neighbors=18) is Accuracy Score : 0.7951289398280802 Confusion matrix : [[1018 397] [ 175 1202]] Classification Report : precision recall f1-score support 0 0.85 0.72 0.78 1415 1 0.75 0.87 0.81 1377 accuracy 0.80 2792 macro avg 0.80 0.80 0.79 2792 weighted avg 0.80 0.80 0.79 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix RandomForestClassifier() is Accuracy Score : 0.8535100286532952 Confusion matrix : [[1197 218] [ 191 1186]] Classification Report : precision recall f1-score support 0 0.86 0.85 0.85 1415 1 0.84 0.86 0.85 1377 accuracy 0.85 2792 macro avg 0.85 0.85 0.85 2792 weighted avg 0.85 0.85 0.85 2792 ============================================================================================================ Classification ML Algorithm Evaluation Matrix ExtraTreesClassifier() is Accuracy Score : 0.8384670487106017 Confusion matrix : [[1187 228] [ 223 1154]] Classification Report : precision recall f1-score support 0 0.84 0.84 0.84 1415 1 0.84 0.84 0.84 1377 accuracy 0.84 2792 macro avg 0.84 0.84 0.84 2792 weighted avg 0.84 0.84 0.84 2792 ============================================================================================================
CrossValidation :¶
from sklearn.model_selection import cross_val_score
model=[LogisticRegression(),
SVC(),
GaussianNB(),
DecisionTreeClassifier(),
KNeighborsClassifier(n_neighbors = 18),
RandomForestClassifier(),
ExtraTreesClassifier()]
for m in model:
score = cross_val_score(m, X, Y, cv =5)
print('\n')
print('\033[1m'+'Cross Validation Score', m, ':'+'\033[0m\n')
print("Score :" ,score)
print("Mean Score :",score.mean())
print("Std deviation :",score.std())
print('\n')
print('============================================================================================================')
Cross Validation Score LogisticRegression() : Score : [0.73777539 0.7485223 0.80763031 0.80763031 0.8155914 ] Mean Score : 0.7834299399675282 Std deviation : 0.033192035622825765 ============================================================================================================ Cross Validation Score SVC() : Score : [0.75067168 0.75013434 0.76786674 0.76732939 0.76451613] Mean Score : 0.7601036556828621 Std deviation : 0.008003700479845513 ============================================================================================================ Cross Validation Score GaussianNB() : Score : [0.74314884 0.75174637 0.78828587 0.80601827 0.80860215] Mean Score : 0.7795603011446037 Std deviation : 0.027272687420731516 ============================================================================================================ Cross Validation Score DecisionTreeClassifier() : Score : [0.71843095 0.75174637 0.83127351 0.83073616 0.82258065] Mean Score : 0.7909535282799743 Std deviation : 0.04691558087085482 ============================================================================================================ Cross Validation Score KNeighborsClassifier(n_neighbors=18) : Score : [0.75980656 0.77485223 0.79043525 0.79742074 0.79569892] Mean Score : 0.783642740346559 Std deviation : 0.014330107182171665 ============================================================================================================ Cross Validation Score RandomForestClassifier() : Score : [0.77431488 0.80010747 0.88285868 0.88823213 0.89623656] Mean Score : 0.8483499448209715 Std deviation : 0.05076040993695079 ============================================================================================================ Cross Validation Score ExtraTreesClassifier() : Score : [0.75389575 0.77700161 0.88393337 0.8866201 0.88924731] Mean Score : 0.8381396289427006 Std deviation : 0.059823582410026624 ============================================================================================================
Hyper Parameter Tuning : GridSearchCV¶
from sklearn.model_selection import GridSearchCV
parameter = { 'max_depth': [5, 10,20,40,50,60],
'criterion':['gini','entropy']}
GCV = GridSearchCV(DecisionTreeClassifier(),parameter)
GCV.fit(X_train,Y_train)
GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [5, 10, 20, 40, 50, 60]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [5, 10, 20, 40, 50, 60]})DecisionTreeClassifier(max_depth=10)
DecisionTreeClassifier(max_depth=10)
GCV.best_params_
{'criterion': 'gini', 'max_depth': 10}
dtc1=DecisionTreeClassifier(max_depth=40,criterion="entropy")
dtc1.fit(X_train,Y_train)
y_pred=dtc1.predict(X_test)
print(classification_report(y_pred,Y_test))
precision recall f1-score support
0 0.78 0.80 0.79 1382
1 0.80 0.78 0.79 1410
accuracy 0.79 2792
macro avg 0.79 0.79 0.79 2792
weighted avg 0.79 0.79 0.79 2792
Final Model¶
Final_mod = RandomForestClassifier(bootstrap=True,criterion='entropy',n_estimators= 60, max_depth=10 ,max_features='sqrt')
Final_mod.fit(X_train,Y_train)
y_pred=Final_mod.predict(X_test)
print('\033[1m'+'Accuracy Score :'+'\033[0m\n', accuracy_score(Y_test, y_pred))
Accuracy Score :
0.8531518624641834
# Lets plot confusion matrix for FinalModel
Matrix = confusion_matrix(Y_test, y_pred)
x_labels = ["NO","YES"]
y_labels = ["NO","YES"]
fig , ax = plt.subplots(figsize=(5,5))
sns.heatmap(Matrix, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax,
cmap="plasma", xticklabels = x_labels, yticklabels = y_labels)
plt.xlabel("Predicted Label",fontsize=14,fontweight='bold')
plt.ylabel("True Label",fontsize=14,fontweight='bold')
plt.title('Confusion Matrix for Final Model',fontsize=20,fontweight='bold')
plt.show()
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(Final_mod, X_test, Y_test)
plt.legend(prop={'size':11}, loc='lower right')
plt.title('AOC ROC Curve of Final Model',fontsize=20,fontweight='bold')
plt.show()
auc_score = roc_auc_score(Y_test, Final_mod.predict(X_test))
print('\033[1m'+'Auc Score :'+'\033[0m\n',auc_score)
Auc Score :
0.8536414749121739
Saving model¶
import joblib
joblib.dump(Final_mod,'Customer_Churn_Final.pkl')
['Customer_Churn_Final.pkl']
Predicting the Final Model¶
# Prediction
prediction = Final_mod.predict(X_test)
Actual = np.array(Y_test)
df_Pred = pd.DataFrame()
df_Pred["Predicted Values"] = prediction
df_Pred["Actual Values"] = Actual
df_Pred.head()
| Predicted Values | Actual Values | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |